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Abstract 



We present a general approach for collaborative filtering (CF) using spectral regularization 
to learn linear operators from "users" to a set of possibly desired "objects" . Recent low- 
rank type matrix completion approaches to CF are shown to be special cases. However, 
unlike existing regularization based CF methods, our approach can be used to also incor- 
porate information such as attributes of the users or the objects — a limitation of existing 
regularization based CF methods. We provide novel representcr theorems that we use to 
develop new estimation methods. We then provide learning algorithms based on low-rank 
decompositions, and test them on a standard CF dataset. The experiments indicate the 
advantages of generalizing the existing regularization based CF methods to incorporate re- 
lated information about users and objects. Finally, we show that certain multi-task learning 
methods can be also seen as special cases of our proposed approach. 



1. Introduction 

Collaborative filtering (CF) refers to the task of predicting preferences of a given "user" 
for some "objects" (e.g., books, music, products, people, etc.) based on his/her previously 
revealed preferences — typically in the form of purchases or ratings — as well as the revealed 
preferences of other users. In a book recommender system, for example, one would like to 
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suggest new books to someone based on what she and other users have recently purchased 
or rated. The ultimate goal of CF is to infer the preferences of users in order to offer them 
new objects. 



A nu r nber of CF methods have be en developed in the past (jBreese et al.l . ll998l . lHeckerman et al. 
2000, ISalakhutdinov et al.l. 120071^. Recently there has been interest in CF using regulariza- 



tion based methods ( Srebro and Jaakkola . 20031 ). This work adds to that literature by 



developing a novel general approach to developing regularization based CF methods. 

Recent regularization based CF methods assume that the only data available are the re- 
vealed preferences, where no other information such as background information on the 
objects or users is given. In this case, one may formulate the problem as that of inferring 
the contents of a partially observed preference matrix: each row represents a user, each 
column represents an object (e.g., books or movies), and entries in the matrix represent 
a given user's rating of a given object. When the only information available is a set of 
observed user/object ratings, the unknown entries in the matrix must be inferred from the 
known ones - of which there are typically very few relative to the size of the matrix. 

To make useful predictions within this setting, regularization based CF methods make 
certain assumptions about the relatedness of the objects and users. The most common 
assumption is that preferences can be decomposed into a small number of factors, both for 
users and objects, resulting in the searc h for a low-rank matrix w hich approximates the 
partially observed matrix of preferences JSrebro and Jaakkoli The rank constraint 

can be interpreted as a regularization on the hypothesis space. Since the rank constraint 
gives rise to a non-convex set of matrices, the associated optimizati on problem will be a 
diffic ult non-convex problem for which only heu ristic algorith i ns exi st (ISrebro and Jaakkolal . 
2OO3I ). An alternative formulation, proposed by Srebro et al. ( 2005 ). suggests penalizing the 
predicted matrix by its trace norm, i.e., the sum of its singular values. An added benefit 
of the trace norm regularization is that, with a sufficiently large regularization parameter, 
the final solution will be low-rank ( Fazel et al. . 2001 . Bach . 20081 ). 



However, a key limitation of current regularization based CF methods is that they do not 
take advantage of information, such as attributes of users (e.g., gender, age) or objects 
(e.g., book's author, genre), which is often available. Intuitively, such information might 
be useful to guide the inference of preferences, in particular for users and objects with very 
few known ratings. For example, at the extreme, users and objects with no prior ratings 
can not be considered in the standard CF formulation, while their attributes alone could 
provide some basic preference inference. 

The main contribution of this paper is to develop a general framework and specific algo- 
rithms also based on novel representer theorems for the more general CF setting where other 
information, such as attributes for users and/or objects, may be available. More precisely 
we show that CF, while typically seen as a problem of matrix completion, can be thought of 
more generally as estimating a linear operator from the space of users to the space of objects. 
Equivalently, this can be viewed as learning a bilinear form between users and objects. We 
then develop spectral regularization based methods to learn such linear operators. When 
dealing with operators, rather than matrices, one may also work with infinite dimension, 
allowing one to consider arbitrary feature space, possibly induced by some kernel function. 
Among key theoretical contributions of this paper are new representer theorems, allowing 
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us to develop new general methods that learn finitely many parameters even when working 
in infinite dimensional user/object feature space. These representer theorems generalize 
the classical representer theorem for minimization of an empirical loss penalized by the 
norm in a Reproducing Kernel Hilbert Space (RKHS) to more general penalty functions 
and function classes. 

We also show that, with the appropriate choice of kernels for both users and objects, we 
may consider a number of existing machine learning methods as special cases of our general 
framework. In particular, we show that several CF methods such as rank constrained 
optimization, trace-norm regularization, and those based on Probenius norm regularization, 
can all be cast as special cases of spectral regularization on operator spaces. Moreover, 
particular choices of kernels lead to specific sub-cases such as regular matrix completion and 
multitask learning. In the specific application of collaborative filtering with the presence 
of attributes, we show that our generalization of these sub-cases leads to better predictive 
performance. 

The outline of the paper is as follows. In Section [2l we review the notion of a compact 
operator on Hilbert Space, and we show how to cast the collaborative filtering problem 
within this framework. We then introduce spectral regularization and discuss how rank 
constraint, trace norm regularization, and Frobenius norm regularization are all special cases 
of spectral regularization. In Section [3l we show how our general framework encompasses 
many existing methods by proper choices of the loss function, the kernels, and the spectral 
regularizer. In SectionHl we provide three representer theorems for operator estimation with 
spectral regularization which allow for efficient learning algorithms. Finally in Section [5] we 
present a number of algorithms and describe several techniques to improve efficiency. We 
test these algorithms in Section [6] on synthetic examples and a widely used movie database. 

2. Learning compact operators with spectral regularization 

In this section we propose a mathematical formulation for a general CF problem with 
spectral regularization. We then show in Section [3] how several learning problems can be 
cast under this general framework. 

2.1 A general CF formulation 

We consider a general CF problem in which our goal is to model the preference of a user 
described by x for an item described by y. We denote by x and y the data objects containing 
all relevant or available information; this could, for example, include a unique identifier i for 
the i-th user or object. Of course, the users and objects may additionally be characterized by 
attributes, in which case x or y would contain some representation of this extra information. 
Ultimately, we would like to consider such attribute information as encoded in some positive 
definite kernel between users, or equivalently between objects. This naturally leads us to 
model the users as elements in a Hilbert space X, and the objects they rate as elements of 
another Hilbert space 3^. 

We assume that our observation data is in the form of ratings from users to objects, a 
real-valued score representing the user's preference for the object. Alternatively, similar 
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methods can be applied when the observations are binary, specifying for instance whether 
or not a user considered or selected an object. 

Given a series of N observations (xj, yj, ,,,^7v in A' xJ^xM, where tj represents the rating 
of user Xj for object y^, the generalized CF problem is then to infer a function f : X xy ^ M. 
that can then be used to predict the rating of any user x G A' for any object y € 3^ by 
/ (x, y). Note that in our notation, Xj and y^ represent the user and object corresponding 
to the i-th rating available. If several ratings of a user for different objects are available, as 
is commonly the case, several Xj's will be identical in X — a slight abuse of notation. We 
denote by and the linear spans of {xj , i = 1, . . . , A^} and {yj , i = 1, . . . , N} in X 
and y, with respective dimensions and my. 

For the function to be estimated we restrict ourselves to bilinear forms given by: 



for some compact operator F. We now denote by Bq {y, X) the set of compact opera- 
tors from X to y. For an introduction to relevant concepts in functional analysis, see 
Appendix [Al 

In the general case we consider below, if X and y are not Hilbert spaces, one could also first 
map (implicitly) users x and objects y into possibly infinite dimensional Hilbert feature 
spaces <l>;f (x) and ^y(y) and use kernels. We refer the reader to Appendix A for basic 
definitions and properties related to compact operators that are useful below. The inference 
problem can now be stated as follows: 

Given a training set of ratings, how may we estimate a "good" compact operator F to predict 
future ratings using (1^? 

We estimate the operator F in ([T|) from the training data using a standard regularization and 
statistical machine learning approach. In particular, we propose to define the operator as 
the solution of an optimization problem over Bq {y, X) whose objective function balances a 
data fitting term R]\f(F), which is small for operators that can correctly explain the training 
data, with a regularization term ^}{F). We now describe these two terms in more details. 

2.2 Data fitting term 

Given a loss function i(t', t) that quantifies how good a prediction G M is if the true value 
is t € M, we consider a fitting term equal to the empirical risk, i.e., the mean loss incurred 
on the training set: 



The particular choice of the loss function should typically depend on the precise problem 
to be solved and on the nature of the variables t to be predicted. See more details in 
Section [3l In particular, while the representer theorems presented in Section H] do not need 
any convexity with respect to this choice, the algorithms presented in Section [5] do. 




(1) 





i=l 



4 



2.3 Regularization term 

For the regularization term, we focus on a class of spectral functions defined as follows. 

Definition 1 A function 0, : Bq {y, ^) R U {+00} is called a spectral penalty function 
if it can be written as: 

d 

n{F)=Y,s^{a,{F)) , (3) 

i=l 

where for any i > l,Si : 1-^ M"*" U {+cxd} is a non- decreasing penalty function satisfying 
s{0) = 0, and {ai{F))^^^ ^ are the d singular values of F in decreasing order — d possibly 
infinite. 

Note that by the spectral theorem presented in Appendix [Aj any compact operator can be 
decomposed into singular vectors, with singular values being a sequence that tends to zero. 

Spectral penalty functions include as special cases several functions often encountered in 
matrix completion problems: 



• For a given integer r, taking Sj = for i = 1, . . . , r and Sr+i{u) = +00 if u > 0, leads 
to the function: 

0(F) = 1° if™MF)<r, 
I +00 otherwise. 

In other words, the set of operators F that satisfy ^{F) < +00 is the set of operators 
with rank smaller than r. 

• Taking Si{u) = u for all i results in the trace norm penalty (see Appendix |A|) : 

«(F) = |ll^ll' "■^^S.W^). (5, 

I +CX3 otherwise, 

where we note with Bi {y, X) the set of operators with finite trace norm. Such oper- 
ators are referred to as trace class operators. 

• Taking Si{u) = v? for all i results in the squared Hilbert-Schmidt norm penalty (also 
called squared Frobenius norm for matrices, see Appendix |A|) : 

\+oo otherwise, 

where we note with B2 (3^, X) the set of operators with finite squared Hilbert-Schmidt 
norm. Such operators are referred to as Hilbert Schmidt operators. 



These particular functions can be combined together in different ways. For example, we 
may constrain the rank to be smaller than r while penalizing the trace norm of the matrix, 
which can be obtained by setting Si{u) = u for i = 1, . . . , r and Sr+i{u) = +00 if n > 0. 
Alternatively, if we want to penalize the Frobenius norm while constraining the rank, we 
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set Si{u) = iov i = 1, . . . ,r and Sr+i{u) = +00 if u > 0. We state these two choices 
of 0, exphcitly since we use these in the experiments (see Section [6|) or to design efficient 
algorithms (see Sectional): 



Trace+Rank Penalty: 0(F) 



Frobenius+Rank Penalty: ^(F) 



II F 111 if rank(F) < r, 
+00 otherwise. 

'11 F \\l if rank(F) < r, 



+ CX3 



otherwise. 



(7) 
(8) 



2.4 Operator inference 



With both a fitting term and a regularization term, we can now formally define our inference 
approach. It consists of finding an operator F, if there exists one, that solves the following 
optimization problem: 

F e argmin Rn{F) + Xn{F) , (9) 

where A € M is a parameter that controls the trade-off between fitting and regularization, 
and where RNiF) and 0,{F) are respectively defined in ([2]) and ([3]). We note that if the 
set {F G Bo {y, X) , r2(F) < +00} is not empty, then necessarily the solution F of this 
optimization problem must satisfy ^{F) < 00. 

We show in Sections U] and how problem ^ can be solved in practice in particular for 
Hilbert spaces of infinite dimensions. Before exploring such implementation-related issues, 
in the following section we provide several examples of algorithms that can be derived as 
particular cases of ([9]) and highlight their relationships to existing methods. 



3. Examples and related approaches 

The general formulation ([9]) can result in a variety of practical algorithms potentially useful 
in different contexts. In particular, three elements can be tailored to one's particular needs: 
the loss function, the kernels (or equivalently the Hilbert spaces), and the spectral penalty 
term. We start this section by some generalities about the possible choices for these elements 
and their consequences, before highlighting some particular combinations of choices relevant 
for different applications. 

1. The loss function. The choice of £ defines the empirical risk through ([2]). It is a 

classical component of many machine learning methods, and should typically depend 
on the type of data to be predicted (e.g., discrete or continuous) and of the final 
objective of the algorithm (e.g., classification, regression or ranking). The choice of ^ 
also influences the algorithm, as discussed in Section [5j As a deeper discussion about 
the loss function is only tangential to the current work, we only consider the square 
loss here, knowing that other convex losses may be considered. 

2. The spectral penalty function. The choice of f^(F) defines the type of constraint 
we impose on the operator that we seek to learn. In Section 12.31 we gave several 
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examples of such constraints including the rank constraint the trace norm con- 
straint ([5]), the Hilbert-Schmidt norm constraint ([6|), or the trace norm constraint over 
low-rank operators ([7]). The choice of a particular penalty might be guided by some 
considerations about the problem to be solved, e.g., finding low-rank operators as a 
way to discover low-dimensional latent structures in the data. On the other hand, from 
an algorithmic perspective, the choice of the spectral penalty may affect the efficiency 
or feasibility of our learning algorithm. Certain penalty functions, such as the rank 
constraint for example, will lead to non-convex problems because the corresponding 
penalty function (j4]) is not convex itself. However, the same rank constraint can vastly 
reduce the number of parameters to be learned. These algorithmic considerations are 
discussed in more details in Section O 

The kernels. Our choice of kernels defines the inner products (i.e., embeddings) 
of the users and objects in their respective Hilbert spaces. We may use a variety of 
possible kernels depending on the problem to be solved and on the attributes available. 
Interestingly, the choice of a particular kernel has no influence on the algorithm, as we 
show later (however, it does of course influence the running time of these algorithms). 
In the current work, we focus on two basic kernels (Dirac kernels and attribute kernels) 
and in Section 13.41 we discuss combining these. 

• The first kernel we consider is the Dirac kernel. When two users (resp. two 
objects) are compared, the Dirac kernel returns 1 if they are the same user 
(resp. object), and otherwise. In other words, the Dirac kernel amounts to 
representing the users (resp. the objects) by orthonormal vectors in X (resp. in 
y) . This kernel can be used whether or not attributes are available for users and 
objects. We denote by /c^ (resp. /c^) the Dirac kernel for the users (resp. objects). 

• The second kernel we consider is a kernel between attributes, when attributes 
are available to describe the users and/or objects. We call this an "attribute 
kernel". This would typically be a kernel between vectors, such as the inner 
product or a Gaussian RBF kernel, when the descriptions of users and/or objects 
take the form of vectors of real- valued attributes, or any kernel on structured 
objects ^Shawe-TavlorandCristianini B. We denote by (resp. fc^) the 



attributes kernel for the users (resp. objects). 



In the following section we illustrate how specific combinations of loss, spectral penalty 
and kernels can be relevant for various settings. In particular the choice of kernels leads 
to new methods for a range of different estimation problems; namely, matrix completion, 
multi-task learning, and pairwise learning. In Section 13.41 we consider a new representation 
that allows interpolation between these particular problem formulations. 



3.1 Matrix completion 

When the Dirac kernel is used for both users and objects, then we can organize the data 
{xj,i = 1, . . . ,n} into nx groups of identical data points and similarly {yi,i = 1, . . . ,n} 
into ny groups. Since we use the Dirac kernel, we can represent each of these groups 
by the elements of the canonical basis (ui, . . . , u„^) and (vi, . . . , v^^) of M"'^ and W^^ , 
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respectively. A bilinear form using Dirac kernels only depends on the identities of the users 
and the objects, and we only predict the rating ti based on the identities of the groups in 
both spaces. If we assume that each pair user/object is observed at most once, the data can 
be re-arranged into a nx x ny incomplete matrix, the learning objective being to complete 
this matrix (indeed, in this context, it is not possible to generalize to never seen points in 
X and y). 

In this case, our bilinear form framework exactly corresponds to completing the matrix, 
since the bilinear function of x and y is exactly equal to uj Mvj where x = Uj (i.e., x is the 
i-th person) and y = vj (i.e., y is the j-th object). Thus, the (i, j)-th entry of the matrix 
M can be assimilated to the value of the bilinear form defined by the matrix M over the 
pair (uj,Vj). Moreover the spectral regularizer corresponds to the corresponding spectral 
function of the complete matrix M G '^'^xxny ^ 

In this context, finding a low-rank approximation of the observed entries in a matrix is an 
appealing strategy, which corresponds to taking the rank penalty constraint (j4]) combined 
with, for example, the square loss error. This however leads to non-convex optimization 
problems with multiple loc al minima, for which only local search heuristics are known 
( Srebro and Jaakkola . 20031 ). To circumvent this issue, convex spectral penalty functions 



can be considered. Indeed, in the case of binary preferences, combining the hinge loss func- 
tion with the trace norm penalt y (|5l) leads to the m aximum margin matrix factorization 



(MMMF) approach proposed by ISrebro et al.l (120051). which can be r ewritten as a semi- 



definite program. For the sake of efficiency, iRennie and Srebrol ^200^ ) proposed to add a 



constraint on the rank of the matrix, resulting in a non-convex problem that can never- 
theless be handled efficiently by classical gradient descent techniques; in our setting, this 
corresponds to changing the trace norm penalty dS]) by the penalty ([7]). 



3.2 Multi-task learning 

It may be the case that we have attributes only for objects y (we could do the same for 
attributes for users). In that case, for a finite number of users {xj , i = 1, . . . , A^} organized 
in nx groups, we aim to estimate a separate function on objects /j(y) for each of the nx 
users i. Considering the estimation of each of these /j's as a learning task, one can possibly 
learn all /j's simultaneously using a multi-task learning approach. 

In order to adapt our general framework to this scenario, it is natural to consider the 
attribute kernel k^^ for the objects, whose attributes are available, and the Dirac kernel 
for the users, for which no attributes are used. Again the choice of the loss function depends 
on the precise task to be solved, and the spectral penalty function can be tuned to enforce 
some sharing of information between different tasks. 

In particular, taking the rank penalty function ^ enforces a decomposition of the tasks 
(learning each /j) into a limited number of factors. This results in a method for multitask 
learning based on a low-rank representation of the predictor functions /j. The resulting 
problem, however, is not convex due to the use of the non-convex rank penalty function. 
A natural alternative is then to replace the rank constraint by the trace norm penalty 
function resulting in a convex optimization problem when the loss function is con- 
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vex. Recently, a similar approach was ind ependently proposed b y Amit et al. ( 20071 ) in the 
context of multiclass classification and by Argvriou et al. ( 20081 ) for multitask learning. 



Alternatively, another strategy to enforc e some constrain t s amo ng the tasks is to constrain 
the variance of the different classifiers. Evgeniou et al. ( 20051 ) showed that this strategy 
can be formulated in the framework of support vector machines by considering a multitask 
kernel, i.e., a kernel kmuiutask over the product space X x y defined between any two 
user/object pairs (x, y) and (x',y') by: 



kmuititask ((x, y) , (x, y)) = (fc^ (x, X ) + c) (y, y') , 



(10) 



where c > controls how the variance of the classifiers is constrained compared to the norm 
of each classifier. As explained in Appendix A, estimating a function over the product space 
A' X 3^ by penalizing the RKHS norm of the kernel (jlOp is a particular case of our general 
framework, where we take the Hilbert-Schmidt norm ([6]) as spectral penalty function, and 
where the kernels between users and between objects are respectively k^ (x, x') + c and 
y')- When c = 0, i.e., when we take a Dirac kernel for the users and an attribute kernel 
for the objects, then penalizing the H ilbert-Schmitt iiorm am ounts estimating independent 
models for each users, as explained in lEvgeniou et"all (|2005l ^. Combining two Dirac kernels 
for users and objects, respectively, and penalizing the Hilbert-Schmitt norm would not be 
very interesting, since the solution would always be except on the training pairs. On 
the other hand, replacing the Hilbert-Schmidt norm defined by other penalties such as the 
trace norm penalty 1^ would be an interesting extension when the kernels k^ (x, x') -|- c 
and fc^(y,y') are used: this would constrain both the variance of the predictor functions 
fi and their decomposition into a small number of factors, which could be an interesting 
approach in some multitask learning applications. 



3.3 Pairwise learning 

When attributes are available for both users and objects then it is possible to take the 
attributes kernels for both of them. Combining this choice with the Hilbert-Schmidt penalty 
([6]) results in classical machine learning algorithms (e.g., an SVM if the hinge loss is taken 
as the loss function) applied to the tensor product of X and y. This strategy is a classical 
approach to learn a function over pairs of points (see, e.g., Jacob and Vert . 20081 ). Replacing 
the Hilbert-Schmidt norm by another spectral penalty function, such as the trace norm, 
would result in new algorithms for learning low-rank functions over pairs. 



3.4 Combining the attribute and Dirac kernels 

As illustrated in the previous subsections, the setting of the application often determines 
the combination of kernels to be used for the users and the objects: typically, two Dirac 
kernels for the standard CF setting without attributes, one Dirac and one attributes kernel 
for multi-task problems, and two attributes kernels when attributes are available for both 
users and objects and one wishes to learn over pairs. 

There are many situations, however, where the attributes available to describe the users 
and/or objects are certainly useful for the inference task, but on the other hand do not fully 
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characterize the users and/or objects. For example, if we just know the age and gender 
of users, we would like to use this information to model their preferences, but would also 
like to allow different preferences for different users even when they share the same age and 
gender. In our setting, this means that we may want to use the attributes kernel in order 
to utilize known attributes from the users and objects during inference, but also the Dirac 
kernel to incorporate the fact that different users and/or objects remain different even when 
they share many or all of their attributes. 

This nat urally leads us to consid er the following convex combinations of Dirac and attributes 
kernels ( Abernethy et al. . 20061 ): 



= rik'l + (1 - ??)A;^, 

where < r/ < 1 and Q < Q < 1. These kernels interpolate between the Dirac kernels 
(77 = and C = 0) and the attributes kernels (77 = 1 and <^ = 1). Combining this choice of 
kernels with, e.g., the trace norm penalty function ([5]), allows us to continuously interpolate 
between different settings corresponding to different "corners" in the (r/, Q) square: standard 
CF with matrix completion in (0,0), multi-task learning in (0,1) and (1,0), and learning 
over pairs in (1, 1). The extra degree of freedom created when rj and Q are allowed to vary 
continuously between and 1 provides a principled way to optimally balance the influence 
of the attributes in the function estimation process. 

Note that our representational framework encompasses simpler natural approaches to in- 
clude attribute information for collaborative filtering: for example, one could consider com- 
pleting matrices using matrices of the form UV~^ + UaRa + UaS\, where UV~^ is a low-rank 
matrix to be optimized, Ua and Va are the given attributes for the first and second domains, 
and Ra, Sa are parameters to be learned. This formulation corresponds to adding an un- 
constrained low-rank term Uy~^ , and the simpler linear predictor from the concatenation 
of attributes UaRa-\-UaS\ ( Jacob and Vert . 20081 ). Our approach implicitly adds a fourth 



cross-product term UaTVJ , where T is estimated from data. This exactly corresponds 
to imposing that the low rank matrix has a decomposition which includes Ua and Va as 
columns. Our combination of Dirac and attribute kernels has the advantage of having spe- 
cific weights 7] and ( that control the trade-off between the constrained and unconstrained 
low-rank matrices. 



4. Representer theorems 

We now present the key theoretical results of this paper and discuss how the general opti- 
mization problem Q can be solved in practice. A first difficulty with this problem is that the 
optimization space {F G Bq {y, X) : ^{F) < 00} can be of infinite dimension. We note that 
this can occur even under a rank constraint, because the set {F € Bq {y, X) : rank(i^) < R\ 
is not included into any finite-dimensional linear subspace if X and 3^ have infinite dimen- 
sions. In this section, we show that the optimization problem ([9]) can be rephrased as a 
finite-dimensional problem, and propose practical algorithms to solve it in Section [5j While 
the reformulation of the problem as a finite-dimensional problem is a simple instance of 
the representer theorem when the Hilbert-Schmidt norm is used as a penalty function (Sec- 
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tion I4.ip , we prove in Section 14.21 a generalized representer theorem that is valid with any 
spectral penalty function. 



4.1 The case of the Hilbert-Schmidt penalty function 

In the particular case where the penalty function i^{F) is the Hilbert-Schmidt norm then 
the set {F € Bq {y, X) : ^{F) < 00} is the set of Hilbert-Schmidt operators. As recalled 
in Appendix |Al this set is a Hilbert space isometric through ([1]) to the reproducing kernel 
Hilbert space TC^ of the kernel: 

k» ((x,x') , (y,y')) = (x,x')_^ (y,y')y , 

and the isometry translates from F to / as: 

\\ffn^ = \\Ff = n{F). 
As a result, in that case the problem ([9]) is equivalent to: 

min {RN{f)+X\\f\\l} . (12) 

In that case t he represen t er th e orem for optimi z ation of empirical risks penalized by the 
RKHS norm dAronszainl . llQSd . IScholkopf et~aD . l200lh can be applied to show that the 
solution of (jl2p necessarily lives in the linear span of the training data. With our notations 
this translates into the following result: 

Theorem 2 If F is a solution of the problem: 



00 



min Rn{F) + \y^a,{F)\ (13) 



1=1 



then it is necessarily in the linear span of {xj ® yj : i = 1, . . . , N}, i.e., it can he written 
as: 

N 

F = Y^ OiXi ® Yi , (14) 



for some a S 



i=l 

pN 



For the sake of completeness, and to highlight why this result is specific to the Hilbert- 
Schmidt pe nalty function (El), we rephrase here, with our notations, the main arguments in 
the proof of IScholkopf et alD200lh . Any operator F in B2 {y, <^) can be decomposed as F = 
Fs + F±, where Fs is the projection of F onto the linear span of {xj (g) yj : i = 1, . . . , N}. 
F± being orthogonal to each Xj (8) yj in the training set, one easily gets Rn{F) = Rn{Fs), 
while II F IP = II Fs |p + \\ F± |p by the Pytha gorean theorem. As a result a minimizer F 
of the objective function must be such that F± = 0, i.e., must be in the linear span of the 
training tensor products. 
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4.2 A Representer Theorem for General Spectral Penalty Functions 



Let us now move on to the more general situation Q where a general spectral function 
0(F) is used as regularization. Theorem [5] is usually not valid in such a case. Its proof 
breaks down because it is not true that ^{F) = ^l{Fs) + ^}{F±) for general 0, or even that 

n{F) > n{Fs). 

The following theorem, whose proof is presented in Appendix|Bl can be seen as a generalized 
representer theorem. It shows that a solution of ([9]), if it exists, can be expanded over a 
finite basis of dimension mx x my (where mx and my are the underlying dimensions of the 
subspaces where the data lie) , and that it can be found as the solution of a finite-dimensional 
optimization problem (with no convexity assumptions on the loss): 

Theorem 3 For any spectral penalty function Q, : Bq {y,X) i-^ M U {+oo}, let the opti- 
mization problem: 

min RN(F)+Xn(F). (15) 
If the set of solutions is not empty, then there is a solution F in ^^(^yN, i.e., there exists 

mx my 

F = ^ ^ ttijUi (g) Vj , (16) 
i=l j=l 

where {ui,...,Umx) and (vi,...,Vmy) form orthonormal bases of Xn and y^, respec- 
tively. Moreover, in that case the coefficients a can be found by solving the following finite- 
dimensional optimization problem: 



min Rn (diaglXaY' ]] + Xn{a) , (17) 

where Q{a) refers to the spectral penalty function applied to the matrix a seen as an operator 
from W^y to W^^, and X G R^x"^^ and Y G R^x"^^- denote any matrices that satisfy K = 
XX^ and G = YY^ for the two N x N Gram matrices K and G defined by Kij = (xj, Xj)^ 
and Gij = {yi,yj)y, for <i,j < N. 

This theorem shows that, as soon as a spectral penalty function is used to control the 
complexity of the compact operators, a solution can be searched in the finite-dimensional 
space r^AT (8> yN, which in practice boils down to an optimization problem over the set of 
matrices of size mx x my. The dimension of this space might however be prohibitively 
large for real- world applications where, e.g., tens of thousands of users are confronted to 
a database of thousands of objects. A convenient way to obtain an important decrease 
in complexity (at the expense of possibly losing convexity) is by constraining the rank of 
the operator through an adequate choice of a spectral penalty. Indeed, the set of non-zero 
singular components of F as an operator is equal to the set of non-zero singular values of 
a in ()16p seen as a matrix. Consequently any constraint on the rank of F as an operator 
results in a constraint on q as a matrix, from which we deduce: 

Corollary 4 //, in Theorem O the spectral penalty function il. is infinite on operators of 
rank larger than R (i.e., aji^i{u) = +oo for u > 0), then the matrix a € M™''*xmy 
has rank at most R. 
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As a result, if a rank constraint rank(i<') < r is added to the optimization problem then the 
representer theorem still holds but the dimension of the parameter a becomes r x i^mx + "n^y) 
instead oi mx ^ J^iy, which is usually beneficial. We note, however, that when a rank 
constraint is added to the Hilbert-Schmidt norm penalty, then the classical representer 
Theorem [2] and the expansion of the solution over N vectors (I14p are not valid anymore, 
only Theorem [3] and the expansion (jl6p can be used. 



5. Algorithms 

In this section we explain how the optimization problem (I17p can be solved in practice. 
We first consider a general formulation, then we specialize to the situation where many x's 
and many y's are identical; i.e., we are in a matrix completion setting where it may be 
advantageous to consider other formulations that take into account some group structure 
explicitly. 



5.1 Convex dual of spectral regularization 

When the loss is convex, we can derive the convex dual problem, which can be helpful for 
actually solving the optimization problem. This could also provide an alternative proof of 
the representer theorem in that particular situation. 

For alH = 1, . . . , A^, we let denote V'i(^j) = ^ ivi,ti) the loss corresponding to predicting Vi 
for the i-th data point. Fo r simplicity, we as sume that each ^/^j is convex (this is usually 
met in practice). Following Bach et al. ( 20051 ) . we let ilj*{ai) denote its Fenchel conjugate 



defined as '4'* (en) = maXt,.gKajUj — ilji{vi). Minimizers of the opti mization problem defining 



the c onjugate function are often referred to as Fenchel duals to aj (jBovd and Vandenberghe 
20031 ). In particular, we have the following classical examples: 



• Least-squares regression: we have i>i{vi) = — and ^*(ai) = ^a? + ajtj. 

• Logistic regression: we have ipi{vi) = log(l + exp{—yiVi)), where yi € {—1,1}, and 
V'*(ai) = (1 + cati) log(l + aiti) - Oiti log{-aiti) if aiti G (-1,0), +00 otherwise. 

We also assume that the spectral regularization is such that for all « € N, Sj = s, where s 
is a convex function such that s(0) = 0. In this situation, we have ^1{A) = "^i^fq s{ai{A)). 
We can also defi ne a Fenchel conjugate fo r 0(A), which is also a spectral function Q*{B) = 



"^isf^ {ai{B)) (jLewis and Sendo\ 



ig ate lo i 
VL I2OO2I ) 



Some special cases of interest for s{a) are: 

• s{a) = \a\ leads to the trace norm and then s*{t) = if jr| is less than 1, and +00 
otherwise. 

• s((t) = ^cr^ leads to the Frobenius/Hilbert Schmidt norm and then s*{t) = |t^. 

• s(a) = elog(l+e'^/^)+elog(l+e^'^/^) is a smooth approximation of |(t|, which becomes 
tighter when e is closer to zero. We have: s*(r) = ^(1+r) log(l+r) + i(l-r) log(l-r). 
Moreover, s'{a) = t (s*)'('^) = = ^ log j^- 
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Once the representer theorem has been apphed, our optimization problem can be rewritten 
in the primal form in (jl7p : 

TV 

min y^i{{XaY^)ii) + Xn{a). (18) 

1=1 

We can now form the Lagrangian, associated with added constraints v = diag(Xay~'') and 
corresponding Lagrange multipher (3 S M"^: 

JV JV 

C{v,a,P) =J2Mvi) - 5Z A(^;i - {XaY^)u) + AO(a), 

1=1 i=l 

and minimize with respect to v and W to obtain the dual problem, which is to maximize: 

-^^PKPi) - Xn* i--X^ Dis.gi(3)Y \ . (19) 
i=i ^ ^ 

Once the optimal dual variable /3 is found (there are as many of those as there are obser- 
vations), then we can go back to a (which may or may not be of smaller size), by Fenchel 
duality, i.e., a is among the Fenchel duals of —jX'^ Diag(/3)y. Thus, when the function s is 
differentiable and strictly convex (which implies that the set of Fenchel duals is a singleton) , 
then we obtain the primal variables a in closed form from the dual var i ables f3. When s is 



not differentiable, e.g., for the trace norm then, following lAmit et al.l (120071 ). we can find 
the primal variables by noting that once /5 is known, the singular vectors of a are known 
and we can find the singular values by solving a reduced convex optimization problem. 



Computational complexity Note that for optimization, we have two strategies: using 
the primal problem in Eq. (jl8|) of dimension nixmy ^ nxny (the actual dimension of the 
underlying data) or using the dual problem in Eq. (|19p of dimension N (the number of 
ratings). The choice between those two formulations is problem dependent. 



5.2 Collaborative filtering 



In the presence of (many) identical columns and rows, which is often the case in collabora- 
tive filtering situations, the kernel matrices K and L have some columns (and thus rows) 
which are identical, and we can instead consider the kernel matrices (with their square-root 
decompositions) K = XX^ and L = YY~^ as the kernel matrices for all distinct elements 
of X and 3^ (let nx and ny be their sizes). Then each observation (xj,yj,tj) corresponds 
to a pair of indices (a(i), b{i)) in {1, . . . , nx} x {1, . . . , ny}, and the primal/dual problems 
become: 



min VVi(5L)Xay^(5,(,)) + AO(a), (20) 

1=1 

where 5^ is a vector with only zeroes except at position u. The dual function is 

TV / TV \ 

1=1 \ i=l J 
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Similar to usual kernel machines and the general case presented above, using the primal 
or the dual formulation for optimization depends on the number of available ratings N 
compared to the ranks mx and my of the kernel matrices K and L. Indeed, the number of 
variables in the primal formulation is mxmy, while in the dual formulation it is A^. 

5.3 Low-rank constrained problem 

We approximate the spectral norm by an infinitely differentiable spectral function. Since 
we consider in this paper only infinitely differentiable loss functions, our problem is that of 
minimizing an infinitely differentiable convex function G{W) over rectangular matrices of 
size px q for certain integers p and q. As a result of our spectral regularization, we hope to 
obtain (approximately) low-rank matrices. In this context, it has proved advantageous to 
consider l ow-rank decompositions of the form W = UVj whe re U and V have m < mi njp, q} 
columns ( Burer and Monteirol . 2005 . Burer and Choi . 20061 ) . Burer and Monteirol (jioOS) ) 



have shown that if m = m.m{p,q} then the non-convex problem of minimizing G{UV^) 
with respect to U and V'^ has no local minima. 

We now prove a stronger result in the context of twice differentiable functions, namely that 
if the global optimum of G has rank r < mm{p, q}, then the low-rank constrained problem 
with rank r + 1 has no local minimum and its global minimum corresponds to the global 
minimum of G. The following theorem makes this precise (see Appendix [C] for proof). 

Proposition 5 Let G be a twice differentiable convex function on matrices of sizepxq with 
compact level sets. Let m > 1 and {U, V) € RP^*" x M'?^'" a local optimum of the function 
H : RP^™ X M"^™ ^ M defined by H{U, V) = GiUV'^), i.e., U is such that VH{U, V) = 
and the Hessian of H at (U, V) is positive semi- definite. If U or V is rank deficient, then 
N = UV~^ is a global minimum of G, i.e., VG{N) = 0. 

The previous proposition shows that if we have a local minimum for the rank-m problem 
and if the solution is rank deficient, then we have a solution of the global optimization 
problem. This naturally leads to a sequence of reduced problems of increasing dimension 
m, smaller than r -\- \, where r is the rank of the global optimum. However, the number 
of iterations of each of the local minimizations and the final rank m cannot be bounded a 
priori in general. 

Note that using a low-rank representation to solve the trace-norm regularized problem leads 
to a non-convex minimization problem with no local minima, while simply using the low- 
rank representation without the trace norm penalty and potentially with a Frobenius norm 
penalty, may lead to local minima; i.e., we consider instead of Eq. ()17p with the trace norm, 
the following formulation: 



min i?;v(diag(xa/3"^yT)) +AV||a(:,A;)||2||/3(:,A;)||2, (21) 

where a{:,k) and I3{:,k) are the A;-th columns of a and (3. In the simulation section, we 
compare the two approaches on a synthetic example, and show that the convex formulation 
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solved through a sequence of non-convex formulations leads to better predictive perfor- 
mance. 



5.4 Kernel learning for spectral functions 

In our collaborative filtering context, there are two potentially useful sources of kernel learn- 
ing: learning the attribute kernels, or learning the weights rj and C between Dir ac kernels and 



attribute kernels. In thi s section, we show how multiple kernel learning (MKL) (jLanckriet et al. 
20041 . iBach et all . l2004l ) may be extended to spectral regularization. 



We first show that the optimization problem that we have defined in earlier sections only 
depends on the Kronecker product of kernel matrices K ®G: 



Proposition 6 The dual solution of the optimization problem in Eq. h22^) depends only on 
the matrix K ® G. 

Proof It suffices to show that for all matrices B, then the positive singular values of BY 
only depend on K®G. The largest singular value is defined as the maximum of a'^X'^BYb 
over unit norm vectors a and b. By a change of variable, it is equivalent to maximize 

iX-a)X-BYiY-i) _ .eciia-)iK^G).ociB) ^.^^ ^^^^^^^ ^ ^ ^Golub and LoaiJ . B . 



|iXTa||||yTfe|| vcc{baT)T(ii-g,G)vec{6aT) 

Thus the largest positive singular value is indeed a function of K G. Results for other 
singular values may be obtained similarly. ■ 



This shows that the natural kern el matrix to be l earne d in our context is the Kronecker 
product K aSi G. We thus follow Lanckriet et al. ( 20041 ) and consider M kernel matr 



ices 



Ki, . . . , Km for X and M kernel matrices Gi, . . . , Gm for y; one possibility could be to 
learn a convex combination of the matrices (8) Gk by minimizing with respect to the 
combination weights the optimal value of the problem in Eq. (j22p . However, unlike the 



problem in 


general. We 


(Bach et al. 




2004^: we 



thus focus on the alternative formulation of the MKL problem (jBach et al.l . 
consider the sum of the predictor functions associated with each of the individual kernel 
pairs (i^fc, Gfc) and penalize by the sum of the norms. 

That is, if we let denote Xi, . . . , Xm and Yi, . . . , Ym the respective square roots of matrices 
Ki , . . . , Km and Gi, . . . ,Gm , we look for predictor functions which are sums of the M 
possible atomic predictor functions, and we penalize by the sum of spectral functions, to 
obtain the following optimization problem: 

n / M 



Vfc,afcGM™-^™§ i=l \fc=l 



We form the Lagrangian: 




N M M 



C{v,ai,.. .,aM,l3) = ^^piivi) - ^/3i{vi - ^{XakY'^)ii) + A^O(afc), 



i=l 1=1 k=l k=l 
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and minimize w.r.t. v and ai, . . . , ctM to obtain the dual problem, which is to maximize 



-XjBmg{P)Yk 



(22) 



In the case of the trace norm, we obtain support kernels (jBach et al.l . |2004| ). i.e., only a 
sparse combination of matrices ends up being used. Note that in the dual formulation, there 
is only one a to optimize, and thus it is preferable to use the dual formulation rather than 
the primal formulation. 



This framework can be naturally applied to combine the four corners defined in Section [3.41 
Indeed, we can form M = 4 kernel matrices for each of the four corners and learn a 
combination of such matrices. We show in Section [6] how the MKL framework allows to 
automatically combine these four corners without setting the trade-off directly though t] 
and C (by the user or through cross-validation). 



6. Experiments 

In this Section we present several experimental findings for the algorithms and methods 
discussed above. Much of the present work was motivated by the problem of collaborative 
filtering and we therefore focus solely within this domain. As discussed in Section [3l by 
using operator estimation and spectral regularization as a framework for CF, we may utilize 
potentially more information to predict preferences. Our primary goal now is to show that, 
as one would hope, such capabilities do improve prediction accuracy. 



6.1 Datasets and Metrics 

We present several plots created by experimenting on synthetic data. This dataset was 
generated as follows: (1) sample i.i.d. multivariate features for x of dimension 6, (2) generate 
i.i.d. multivariate features for y of dimension 6 as well, (3) sample z from a random bilinear 
form in x and y plus some noise, (4) restrict the observed feature space to only 3 features 
for both X and y. Since part of the data is discarded, the label cannot be perfectly predicted 
by the known features. On the other hand, since we keep some of them, knowing and using 
these attributes should work better than not using them. In other words, we expect that 
setting rj and ^ to be values other than or 1 should provide better performance. 

We also experimented with the well-known MovieLens 100k dataset from the GroupLens 
Research Group at the University of Minnesota. This dataset consists of ratings of 1682 
movies by 943 users. Each user provided a rating, in the form of a score from {1, 2, 3, 4, 5}, 
for a small subset of the movies. Each user rated at least 20 movies, and the total number of 
ratings available is exactly 100,000, averaging about 105 per user. This dataset was rather 
appropriate as it included attribute information for both the movies and the users. Each 
movie was labeled with at least one among 19 genres (e.g., action or adventure), while the 
users' attributes included age, gender, and an occupation among a list of 21 occupations 
(e.g., administrator or artist). We converted the users' age attribute to a set of binary 
features that describes to which of 5 age categories the user belongs. 
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Figure 1: Comparison between two spectral penalties: the trace norm (left) and the Probe- 
nius norm (right), each with an additional fixed rank constraint as described in 
Section 15.31 Each surface plot displays performance values over a range of r] and 
C values, all obtained using the synthetic dataset. The minimal value achieved by 
the trace norm is 0.1222 and the one achieved by the rank constraint is 0.1540. 

All test set accuracies are measured as the root mean squared error averaged over 10-fold 
cross validations. In particular, we focus on the comparisons of intermediate values of rj 
and compared to the four "corners" of the r//C— parameter space: 

• 7? = 0, C = 0: matrix completion 

• 7/ = 0, C = 1 and 7] = = 0: multi-task learning on users or objects 

• 7? = 1, C = 1: pairwise learning 

6.2 Results 

Tracenorm Versus Low-rank In Figure [H we present two performance plots over the 
T]/( parameter space, both obtained using the synthetic dataset. The left plot displays 
the results when utilizing the trace norm spectral penalty. Here we used the low rank 
decomposition formulation described in Section 15.31 which (by Proposition [5]) has no local 
minima. The plot on the right utilizes the same rank-constrained formulation, but with a 
Frobenius norm penalty instead. The trace norm constrained algorithm performs slightly 
better. Moreover, best predictive performance is achieved in both cases in the middle of 
the square and not at any of the four corners. 

Kernel Learning In Figure [21 we show the test set accuracy as a function of the reg- 
ularization parameter, when we use the kernels corresponding to the four corners as the 
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Figure 2: Learning the kernel: test set accuracy vs. regularization parameter. Minimum 
value is 0.14. 



four basis kernels. We can see that we recover similar performance (error of 0.14 instead of 
0.12) than by searching over all ij and C's- The same algorithm could also be used to learn 
kernels on the attributes. 

Performance on MovieLens Data Figure [3] shows the predictive accuracy in RMSE on 
the MovieLens dataset, obtained by 10-fold cross-validation. The heat plot provides some 
insight on the relative value, for both movies and users, of the given attribute kernels versus 
the simple identity kernels. The corners have higher values than some of the values inside 
the square, showing that the best balance between attribute and Dirac kernels is achieved 
for 77,CG(0,1). 

7. Conclusions 

We have presented a method for solving a generalized matrix completion problem where we 
have attributes describing the matrix dimensions. The problem is formalized as the problem 
of inferring a linear compact operator between two general Hilbert spaces, which generalizes 
the classical finite-dimensional matrix completion problem. We introduced the notion of 
spectral regularization for operators, which generalized various spectral penalizations for 
matrices, and proved a general representer theorem for this setting. Various approaches, 
such as standard low rank matrix completion, are special cases of our method. It is partic- 
ularly relevant for CF applications where attributes are available for users and/or objects, 
and preliminary experiments confirm the benefits of our method. 

An interesting direction of future research is to explore further the multi-task learning 
algorithm we obtained with low-rank constraint, and to study the possibility to derive 
on-line implementations that may better fit the need for large-scale applications where 
training data are continuously increasing. On the theoretical side, a better understanding 
of the effects of norm and rank regularizations and their interaction would be of considerable 
interest. 
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Figure 3: A heat plot of performance for a range of kernel parameter choices, ij and C, using 
the MovieLens dataset. 



Appendix A. Compact operators on Hilbert spaces 

In this appendix, we recall basic definitions an d prop e rties of Hilbert space operators. We 



refer the interested reader to general books (jBrezisl . Il98d . iBerlinet and Thomas-Agnan 



20031 ) for more details. 

Let X and 3^ be two Hilbert spaces, with respective inner products denoted by (x, x.')^ and 
(y,y')-y for x,x' G and y,y' € y. We denote by B{y,X) the set of bounded operators 
from X to 3^, i.e., of continuous linear mappings from y to X. For any two elements (x,y) 
in ^ X 3^, we denote by x (g) y their tensor product, i.e., the linear operator from y to X 
defined by: 

VhG3^, (x0y)h= (y,h)yx. (23) 

We denote by Bo{y,A^) the set of compact linear operators from y to X, i.e., the set of 
linear operators that map the unit ball of 3^ to a relatively compact set of X. Alternatively, 
they can also be defined as the limit of finite rank operators. 

When X and y have finite dimensions, then Bq {y, X) is simply the set of linear mappings 
from y to X, which can be represented by the set of matrices of dimensions dim {X) x 
dim (3^). In that case the tensor product x (8) y is represented by the matrix xy^ , where 
denotes the transpose of y. 

For general Hilbert spaces X and 3^, any compact linear operator F ^ Bq [y, X) admits a 
spectral decomposition: 

oo 

F = Y^ aiUi (g) Vi . (24) 

i=l 
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Here the the singular values ((7j)jgN form a sequence of non-negative real numbers such 
that hm fjj = 0, and (uj)^ppj and (v^jpi!^ form orthonormal famihes in X and y, respec- 

tively. Although the vectors (uj) .^pj and (vi)-^^ in (p^ are not uniquely defined for a given 
operator F, the set of singular values is uniquely defined. By convention we denote by 
(7i{F), a2{F), . . ., the successive singular values of F ranked by decreasing order. The rank 
of F is the number rank(F) G N U {+cxo} of strictly positive singular values. 

We now describe three subclasses of compact operators of particular relevance in the rest 
of this paper. 

• The set of operators with finite rank is denoted Bp {y, X). 

• The operators F ^ Bq {y, X) that satisfy: 

oo 

J^(Ji(F)2 < OO 

i=l 

are called Hilbert- Schmidt operators. They form a Hilbert space, denoted B2 {y,X), 
with inner product (•, ■);:^;^y between basic tensor products given by: 

(x ® y, x' y')^^^ = (x, x% (y, y')^ . (25) 

In particular, the Hilbert-Schmidt norm of an operator in B2 {y, X) is given by: 



Fh= ya,{Ff 



Another useful characterization of Hilbert-Schmidt operators is the following. Each 
linear operator F : y ^ X uniquely defines a bilinear function fn-Xxy^M^hy 

/(x,y) = (x,Fy)^. 

The set of functions fp associated to the Hilbert-Schmidt operators forms itself a 
Hilbert space of functions ^ x 3^ ^ M, which is the reproducing kernel Hilbert space 
of the product kernel defined for ((x, y) , (x',y')) {X x y)"^ by 

((x, y) , (x', y') ) = (x, x')^ (y , y')^ . 

The operators F Bq {y, X) that satisfy: 

00 

^a,{F) < 00 
1=1 

are called trace-class operators. The set of trace-class operators is denoted Bi {y, X). 
The trace norm of an operator F ^ Bi (y, X) is given by: 



F\U = J2^^il') 



i=l 

Obviously the following ordering exists among these various classes of operators: 

Bp {y, X) c Bi {y, X) c B2 {y. x) c b^ {y, x)cb {y, x) , 

and all inclusions are equalities if X and y have finite dimensions. 
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Appendix B. Proof of Theorem [3] 



We start with a general result about the decrease of singular values for compact operators 
composed with projection: 

Lemma 7 Let Q and Ti he two Hilhert spaces, H a compact linear subspace ofTi, and Yin 
denote the orthogonal projection onto H . Then for any compact operator F : Q ^ Ti it 
holds that: 

Vi>l, aiilinF) <ai{F). 
Proof We use the classical characterization of the i-th singular value: 

(tAF) = max min 11 Fx W-u , 

V€Vi{g)x€V,\\x\\g = l 

where Vi{G) denotes the set of all linear subspaces of Q of dimension i. Now, observing that 
for any x we have || UhFx \ \t-i < \\ Fx \ proves the Lemma. ■ 

Given a training set of patterns {xi,yi)i=i at G x 3^, remember that we denote by 
and the linear subspaces of <Y and y spanned by the training patterns {xj , i = 1, . . . , N} 
and {yi , i = 1, . . . , N}, respectively. For any operator F G Bq {y, X), let us now consider 
the operator G = Hxj^, F^Vn • construction, F and G agree on the training patterns, in 
the sense that for i = 1, . . . , N: 

{y.i,Gy.i)x = {p^uIixM^'^yNyi) X = i^XM^i^FIiyj^yi) = {^i,Fyi).^ . 

Therefore F and G have the same empirical risk: 

RnIF) = Rn{G) . (26) 

Now, by denoting F* the adjoint operator, we can use Lemma [7] and the fact that the 
singular values of an operator and its adjoint are the same to obtain, for any i > 1: 

ai{G)=a^{Ux^FUy^) 
<a,{FUyJ 
= a.iUy^F*) 
< a,iF*) 
= a,{F). 

This implies that the spectral penalty term satisfies 0,{G) < Q{F). Combined with (I26p . 
this shows that if F is a solution to (fT5]) . then G = Uxj^FUyj^ is also a solution. Observing 
that G G (8) yN concludes the proof of the first part of Theorem [3l resulting in (fT6]l . 

We have now reduced the optimization problem in BQ{y,X) to a finite-dimensional opti- 
mization over the matrix a of size mx x my. Let us now rephrase the optimization problem 
in this finite-dimensional space. 

Let us first consider the spectral penalty term Q{F). Given the decomposition (|16p . the 
non-zero singular values of F as an operator are exactly the non-zero singular values of a as 
a matrix, as soon as (ui, . . . , Um;^) and (vi, . . . , Vmy) form orthonormal bases of <Y/v and 
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3^Ar, respectively. In order to be able to express the empirical risk Ri\i{F) we must however 
consider a decomposition of F over the training patterns, as: 



N N 

F = ^^lijXi^yj- (27) 
i=i j=i 

In order to express the singular values from this expression let us introduce the Gram 
matrices K and G of the training patterns, i.e., the NxN matrices defined for i,j = 1, . . . ,N 
by: 

Kij = (xj,Xj)^ , Gij = {yi,yj)y • 

We note that by definition the ranks of K and G are respectively mx and my. Let us 
now factorize these two matrices as K = XX~^ and G = YY~^ , where X G ^Nxmx g^j^^^ 
Y G j^A^xmy g^j^y square roots, e.g., obtained by ker nel PCA or Cholesky decomposi- 
tion ( Fine and Scheinbere . 2001 . Bach and Jordan . 20051 ) . The matrices X and Y provide 



a representation of the pattern in two orthonormal bases which we denote by (ui, 
and (vi, . . . , Vmy) . In particular we have, for any i,j £ 1, . . . ,N: 



mx 



Xi (g) yj = ^ ^ XiiYjmni v„ 



1=1 m=l 

from which we deduce: 



mx rny INN 
1=1 m=l \ i=l j=l 



Comparing this expression to (fT6]) we deduce that: 

a = X^-fY . 

The empirical error Rn{F) is a function of / (x;, y;) for I = I, . . . ,N. From (j27p . we see 
that: 



N N 

/ (Xi, yO = XI XI lij^ilGlj , 

i=l j=l 

and therefore the vector of predictions Fn = (/ {'^i,yi))i=i n ^ rewritten as: 



Fn = diag(i^7G) = diag (^XaY^ 

We can now replace the empirical risk Rn{Fn) by Rn (diag (^XaY~^^^ and the penalty 
0(F) by 0,{a) to deduce the optimization problem (fTT]) from (fT5]) . which concludes the 
proof of Theorem [3l 



Appendix C. Proof of Proposition [5] 

Since the function has compact level sets, we may assume that we are restricted to an open 
bounded subset of M^^'' where the second and first derivatives are uniformly bounded. We 
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let denote C > a common upper bound of all derivatives. The gradient of the function H 
is equal to VH = ( y ) , while the Hessian of H is the following quadratic form: 

V^H[{dU,dV),{dU,dV)] = 2tv dV~^VGdU + V^G[UdV~^ + dUV^ ,UdV^ + dUV^]. 

Without loss of generality, we may assume that the last columns of U and V are equal to 
zero (this can be done by rotation of U or V). The zero gradient assumption implies that 
VG^U = and VGV = 0. While if we take dU and dV with the first m — 1 columns equal 
to zero, and last columns equal to arbitrary u and v, then the second term in the Hessian 
is equal to zero. The positivity of the first term implies that for all u and v, v^VGu ^ 0, 
i.e., the gradient of G at iV = UV~^ is equal to zero, and thus we get a stationary point and 
thus a global minimum of G. 
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